The dataset of Kickstarter Campaigns was downloaded from Kaggle (source).
It contains information of over 300.000 Kickstarter project that were launched in years 2009-2017. The dataset contains variables such as:# Loading Data, choosing columns
df <- read_csv('ks-projects-201801.csv')
df <- df %>%
select(c('name', 'category', 'main_category', 'launched', 'deadline',
'state', 'backers', 'country', 'usd_pledged_real', 'usd_goal_real'))
# Checking NA's in the dataset
df %>%
summarise_all(funs(sum(is.na(.))))
# it seems there are no NA's (only in names which mean nothing in this research)
# Factorise categorical variables
df['main_category'] = as.factor(df$main_category)
df['category'] = as.factor(df$category)
df['state'] = as.factor(df$state)
df['country'] = as.factor(df$country)
## vars n mean sd median trimmed mad
## name* 1 378657 187974.22 108445.68 188057.00 187990.38 139271.00
## category* 2 378661 81.74 45.13 88.00 82.31 56.34
## main_category* 3 378661 8.51 3.90 8.00 8.70 4.45
## launched 4 378661 NaN NA NA NaN NA
## deadline 5 378661 NaN NA NA NaN NA
## state* 6 378661 2.66 1.13 2.00 2.68 0.00
## backers 7 378661 105.62 907.19 12.00 28.84 17.79
## country* 8 378661 19.85 6.27 23.00 21.32 0.00
## usd_pledged_real 9 378661 9058.92 90973.34 624.33 2082.19 925.63
## usd_goal_real 10 378661 45454.40 1152950.06 5500.00 9399.97 6671.70
## min max range skew kurtosis se
## name* 1.00 375755 375754 0.00 -1.20 176.23
## category* 1.00 159 158 -0.06 -1.23 0.07
## main_category* 1.00 15 14 -0.24 -0.80 0.01
## launched Inf -Inf -Inf NA NA NA
## deadline Inf -Inf -Inf NA NA NA
## state* 1.00 6 5 0.43 -0.96 0.00
## backers 0.00 219382 219382 86.76 13954.68 1.47
## country* 1.00 23 22 -1.70 1.30 0.01
## usd_pledged_real 0.00 20338986 20338986 82.19 11796.26 147.84
## usd_goal_real 0.01 166361391 166361391 78.22 7082.76 1873.64
It will not be needed for this research question, but let’s check if out of curiosity
## [1] "number of specific categories of Kickstarter projects: 159"
## [1] "Number of general categories of Kickstarter projects: 15"
It seems that most of the campaigns are either failed or successful.
I will merge canceled and suspended campaigns into failed and discard
all other campaigns to obtain binary class success/failure
df <- df %>%
mutate(state = case_when(
state == 'failed' ~ 'failed',
state == 'successful' ~ 'successful',
state == 'suspended' ~ 'failed',
state == 'canceled' ~ 'failed',
))
df <- df %>% filter(df$state == 'successful' | df$state == 'failed')
df %>%
group_by(state) %>%
summarise(n=n(), percentage = paste(c(round(100*n/376046,2)),'%')) %>%
ggplot(aes(x='', y=n, fill=state)) +
geom_bar(stat="identity", width=1) +
coord_polar("y", start=0) + theme_void() +
geom_text(aes(y = n, label = percentage), color = "white", size=3, position = position_stack(vjust = 0.5)) +
scale_fill_manual(values=c('#2bde73',
'#2bd9de',
"#eeeeee",
'#081245',
'#122906',
'darkgrey')) +ggtitle("Distribution of successful and failed projects")
Great! Now we have binary variable of campaigns’ final states -
Success / Failure
Now, let’s analyse distribution of numerical
variables
First, let’s plot and analyze the distribution of goals that failed and successful projects and their pledges + number of backers.
Distribution of backers in successful projects seems to be normal or
close to normal, whereas the distribution in failed projects seems to be
positively skewed, which makes sense if there are usually less backers
for unsuccessful campaigns.
Goal seems a normal distribution, so it would be good if we
test the normality of it and also, if exists some statistical difference
between means of goals and pledged failed and success projects to answer
our research question
Normality tests for usd_goal_real, usd_pledged_real and backers
anderson-darling test + qqplot ###usd_goal_real
## [1] "usd_goal_real: Anderson-Darling Normality Test p_value: 3.7e-24"
## [1] "usd_goal_real (only successful projects): Anderson-Darling Normality Test p_value: 3.7e-24"
## [1] "usd_goal_real (only failed projects): Anderson-Darling Normality Test p_value: 3.7e-24"
usd_goal_real is not normally distributed as in all cases p-value is
smaller than 0.05
## [1] "usd_pledged_real: Anderson-Darling Normality Test p_value: 3.7e-24"
## [1] "usd_pledged_real (only successful projects): Anderson-Darling Normality Test p_value: 3.7e-24"
## [1] "usd_pledged_real (only failed projects): Anderson-Darling Normality Test p_value: 3.7e-24"
usd_pledged_Real is not normally distributed as in all cases p-value
is smaller than 0.05 ### backers
## [1] "backers: \n Anderson-Darling Normality Test p_value: 3.7e-24"
## [1] "Backers (only successful projects): Anderson-Darling Normality Test backers: 3.7e-24"
## [1] "Backers (only failed projects): Anderson-Darling Normality Test p_value: 3.7e-24"
Neither Goal nor Pledged and Backers are not normally distributed in
general and within groups
It seems that there are significant outliers in all three
variables (goal, pledged and backers)
Let’s remove these rows from
the dataset
# removing usd_pledged_real outliers
pledged_outlier_scores <- scores(df$usd_pledged_real)
df[pledged_outlier_scores > 3 | pledged_outlier_scores < -3, 'usd_pledged_real'] <- NA
# removing usd_goal_real outliers
real_outlier_scores <- scores(df$usd_goal_real)
df[real_outlier_scores > 3 | real_outlier_scores < -3, 'usd_goal_real'] <- NA
# removing backers outliers
backers_outlier_scores <- scores(df$backers)
df[backers_outlier_scores > 3 | backers_outlier_scores < -3, 'backers'] <- NA
# checking for NA's (outliers)
#df %>%
# summarise_all(funs(sum(is.na(.))))
# Dropping rows containing NA values
dim1 = dim(df)[1]
df <- df %>% drop_na()
dim2 = dim(df)[1]
paste('Dropped', dim1-dim2, 'outliers')
## [1] "Dropped 2589 outliers"
first, let’s see violin plots of the three variables grouped by state
before checking the differences between means, it is important to check if variances between groups are equal
## [1] "Levene Test for usd_goal_real variable: Value - 4279.81741093291 ;P - 0"
## [1] "Levene Test for usd_pledged_real variable: Value - 33502.3089465996 ;P - 0"
## [1] "Levene Test for backers variable: Value - 37010.2056946422 ;P - 0"
In all cases, p-value of Levene Test is very close to 0 (smaller
than 0.05) which means that variances are not equal which is an
important insight before conducting T-Test
##
## Welch Two Sample t-test
##
## data: df$usd_goal_real by df$state
## t = 92.395, df = 248355, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group failed and group successful is not equal to 0
## 95 percent confidence interval:
## 23377.89 24391.22
## sample estimates:
## mean in group failed mean in group successful
## 32031.682 8147.127
##
## Welch Two Sample t-test
##
## data: df$usd_pledged_real by df$state
## t = -164.56, df = 140512, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group failed and group successful is not equal to 0
## 95 percent confidence interval:
## -11908.41 -11628.08
## sample estimates:
## mean in group failed mean in group successful
## 1444.783 13213.030
##
## Welch Two Sample t-test
##
## data: df$backers by df$state
## t = -179.5, df = 137880, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group failed and group successful is not equal to 0
## 95 percent confidence interval:
## -152.1321 -148.8457
## sample estimates:
## mean in group failed mean in group successful
## 17.48571 167.97460
In all cases, the result of T-Test indicated p-value lower than
0.05 which means that for every variable, mean of groups (successful /
failed) are not equal
Knowing that for every analysed numeric variable means between groups are different, visualization of means (including confidence intervals) would be insightful
That’s very interesting – failed campaigns tend to have
significantly bigger goals, but collect less money and attract less
backers than campaigns that achieve success.
Let’s check one last thing – how close to goal campaigns are (over or under the goal)
Let’s check how on average campaigns were either above the goal or below the goal. The value will be counted in percentages of money pledged (calculated by dividin usd_pledged_real by usd_goal_real)
This is interesting as well! It seems that successful campaigns
not only have higher pledgesand lower goals at the same time when
compared to failed campaigns, but also, they significantly exceed the
goal (mean=7.41, CI = 1.82), whereas failued campaigns are on average
not even close to their goals (mean=0.29, CI=0.18).
The analysis of means of variables campaign goal, pledged money and number of backers with the use of one-sided T-Test for independent groups revealed that successful Kickstarter campaigns that took place between 2009 and 2017 had significantly lower goals (mean=8147, CI=74.7) than failed campaigns (mean=32032, CI=501). However, successful campaigns had significantly higher average pledged money (mean=13213, CI=138) and average number of backers (mean=168, CI=1.62) than failed campaigns with lower average pledged money (mean=1445, CI=25.0) and number of backers (mean=17.5, CI=0.25). Additionally, successful campaigns on average exceed the goal by 741% (mean=7.41, CI=1.82), whereas failed campaigns raise on average 28.62% of established goal (mean=0.286, CI=0.18).
These results made me think more about the dataset. I came up with
another hypothesis that can be tested:
Variables such as
usd_goal, data of launch, period (30/60 days), category, main_category,
and country / region have relationship with whether campaign is
successful or not.
We already know the distribution of numerical data, so analysing
it again will not be needed. Also, outliers have already been dropped.
What’s more data is already preprocessed (changed state variable to
binary)
“Film and Video” is the most frequent project category, whereas
“Dance” is least frequent. I’m curious if there is a relationship
between frequency of category and it’s success rate.
Hmm… Interesting, the least frequent categories like Dance,
Theater or Comics have the highest success rate which is higher than
50%! That might indicate that there is a relationship between category
(and its frequency) and state of the project (success / failure)
Let’s check which category has most generous backers – or
rather, which category got on average biggest amount of money from
backers.
Category “Dance” is most successful, but it also has low project
mean goals and relatively low mean pledged amounts. On the other hand,
technology is least successful, even if it highest mean goal and very
high mean pledged amounts. This might indicate the relationship between
main category and success of campaign.
And again, it Dance is highlighted - this time the analysis show
that it was most successful project category from 2010 to 2014 – I
wonder if Kickstarter became popular due to Dance category…
I wonder how the success rate was changing over years and which categories were most popular over those years.
It seems that average success rate has fallen drastically after
2013 and hasn’t came back to the prior state until 2017. I wonder what
could cause this decrease? Maybe number of new projects every year? More
projects, more failures?
WOW! It seems that the annual average number of projects and
annual average success rate might be correlated!
## [1] "Pearson's correlation value: -0.840287151309151 ;p-value: 0.00456868893598974"
Pearson’s correlation p-value is lower than 0.05 which indicates that there is a (negative) significant correlation (with coefficient -0.84) between annual number of projects and annual success rate! That’s very intersting insight about functioning of Kickstarter! #Pattern #Spotted :D
Now, let’s check if there is any trend in months
We can notice a pattern that there are significant decreases of
success rate in July, December and January. That might indicate
relationship between month of year and success of the project. Also,
let’s check if average number of projects per month of year has any
relationship with success rate.
From these two graphs it is difficult to distinguish a
relationship, let’s calculate the correlation, just in case :D
## [1] "Pearson's correlation value: 0.300896677207677 ;p-value: 0.341927140320416"
Person’s correlation between monthly success rate and monthly
number of projects does not indicate significant relationship as the
p-value is equal to 0.34 (>0.05). Which means that there might be a
seasonality in success rate (in terms of months), but it cannot be
explained with monthly number of launched projects.
Kickstarter allows users to start funding for month or for two
months, let’s check if it has any influence on success of
campaigns.
First, let’s calculate length of each campaign
df['deadline'] <- as.Date(df$deadline)
df['launched'] <- as.Date(df$launched)
df['length_days'] <- as.numeric(df$deadline - df$launched)
#max(df$length_days)
df <- df %>% #removing an outlier
filter(!length_days == 14867)
df <- df %>% # new column indicating whether campaign has 1 month or 2
mutate(
length_months = round(length_days / 30)
)
Let’s check distribution of successful and failed campaigns
The most part of projects have 1 month of campaign. We can see
that the ratio of successful one month campaigns is better than projects
with 1.5 or 2 months of campaign. It indicates that there might be
significant relationship between length of campaign and its success.
p <- df %>%
filter(length_months <= 2 & length_months > 0) %>%
mutate(
state_en = case_when(
state == 'failed' ~ 0,
state == 'successful' ~ 1)) %>%
group_by(length_months) %>%
summarise(success_ratio = mean(state_en),
n=n()) %>%
arrange(desc(success_ratio))
datatable(p)
1-month-long campaigns seems to have significantly higher
success rate. However, the classes are highly imbalanced.
One last thing to check in this dataset are countries. Let’s start with distributino of project among countries
It seems that vast majority of projects come from USA. I believe
that segregarion of countries into region will help to slightly balance
this imbalnace :D
Looks a little bit better! Let’s see distributin of success and
failures of campaigns among the regions
Graph like this is not really clear and it’s hard to make any
conclusions. Let’s try to show the success rate per country in a form of
a table:
According to this analysis, the US has not only the highest
number of campaigns, but also the highest success rate, on the other
hand, European countries (excluding GB) have the smallest success rate.
It might indicate a relationship between success and region, however,
the distribution is imbalanced and sample from US is significantly
bigger than any other which makes it difficult to asses if such
relationship exists.
Conduct qualitative research about possible causes of relationships indicated in this analysis
#It was really long, but I hope that at least some of the insights are useful!